5 Qualitative Comparative Analysis (QCA) - setup

5.1 Original Data-set Description

The data-set for this analysis comes from an ESGeo draft project for ESG score forecasting. The data-set, as previously mentioned, does not present a codebook for variables so the description of its components will be more of a listing of the variables it presents.

set of Xs

Variable Name	Short description
country	company country
continent	company continent
sector	company sector
tot_rev_19	total revenues of the company in 2019
P/E-19	price to earning in 2019. It is the price per share of a company’s compared to the company’s earning per share. It allows investors to determine whether a stock is over- or under-valuated
industry_name	TRBC industry name
ind_group_name	TRBC industry group name
business_sect_name	TRBC business sector name
econ_sect_name	TRBC economic sector name

The TRBC stands for The Refinitiv Business Classification it is a detailed sector and industry classification that presents five levels of granularity:

1. 13 Economic Sectors

2. 33 Business Sectors

3. 62 Industry Groups

4. 154 Industries

5. 898 Activities

For more information on the TRBC Sector Classification check: https://www.refinitiv.com/en/financial-data/indices/trbc-business-classification

For this analysis, we have chosen to work with the first level of granularity: Economic Sector in order not to overload the data when, in the new data-set creation, we have to rearrange each nominal variable into as many dichotomous variables as each nominal category present in the original variable.

set of Ys

Variable Name	Short description
HR_score	Human Resources score
ENV_score	Environmental score
BUSBEHAV_score	Business Behavior score
COMINV_score	Community Involvement score
HRts_score	Human Rights score
OVERALL_score	Overall ESG score

These Ys are all concurring to the OVERALL_score variable but the weighing of each variable is not available. In this analysis we will consider only the Overall ESG score as our Y since it is the final score of a company ESG positioning and in order to not overload the new data-set we are going to create.

5.2 Data cleaning and management

5.2.1 Libraries

library(tidyverse)
library(dplyr)
library(hrbrthemes) 
library(ggplot2)
library(QCA)
library(QCAtools)
library(ggpubr)

5.2.2 Data-set Cleaning

Load the original data-set named “datawhole”

datawhole<-readXL("datawhole.xlsx", rownames = FALSE, header = TRUE, na = "", sheet = 1, 
              stringsAsFactors = FALSE)

5.2.3 Dropping NAs

The dropping of NAs is carried out in 2 steps:

the first is straightforward as shown in the following code:

datawhole<-drop_na(datawhole)

the second, consists in the dropping of NAs that are coded in the data-set as “Nonspecified sector” or “Unable to collect data for the field ‘TR.TRBCIndustry’ and some specific identifier(s)”:

datawhole<-datawhole[!(datawhole$ind_group_name=="Nonspecified sector"| datawhole$ind_group_name=="Unable to collect data for the field 'TR.TRBCIndustry' and some specific identifier(s)."),]            
datawhole<-datawhole[!(datawhole$business_sect_name=="Nonspecified sector"| datawhole$business_sect_name=="Unable to collect data for the field 'TR.TRBCIndustry' and some specific identifier(s)."),]            
datawhole<-datawhole[!(datawhole$econ_sect_name=="Nonspecified sector"| datawhole$econ_sect_name=="Unable to collect data for the field 'TR.TRBCIndustry' and some specific identifier(s)."),]            
datawhole<-datawhole[!(datawhole$industry_name=="Nonspecified sector"| datawhole$industry_name=="Unable to collect data for the field 'TR.TRBCIndustry' and some specific identifier(s)."),]

5.2.4 Dropping misclassification

While running through the Economic Sector variable, it has been noticed a misclassification error. The error lays in the fact that the “Telecommunication Services” label is not an economic sector according to the TRBC but a Business Sector.

datawhole<-datawhole[!(datawhole$econ_sect_name=="Telecommunications Services"),]

5.2.5 Creating new variables

Once the dropping of NAs is concluded, what we want to do is to create some new variables to prepare the data set for the QCA we will later perform:

the first is the logarithm of revenues 2019: LOG_REV_19. We do so to manage the great variability in the order of magnitude of the observations due to outliers. Another way we could have managed this could have been to truncate outliers and keep only the observations in an intermediate range -which also happen to be the most numerous- but this operation would have caused us to lose potentially important information.

datawhole['log_rev_19'] <- log(datawhole$tot_rev_19)

The second variable we want to create is one keeping information about the difference between revenues from 2019 and 2018 to make sense of the revenue variability of the companies within the sample. The initial idea was to firstly calculate the logarithm of revenues 2018 and secondly the delta between the revenues of the 2 years. Unfortunately this operations would have created NaNs since the natural logarithm of a negative number is undefined. To bypass this problem, we created a variable named ******DELTAREV_100****** that is the ratio between revenues from 2019 and revenues from 2018 then multiplied by 100. Doing so, our natural 0 is brought to 100 meaning that if the observation value for this variable is 100, the company in question did not present any increase or decrease in revenues between 2018 and 2019. If the value is more than 100, the company has registered an increase in revenues in 2019, while if the value is less than 100, the company has had a decrease in revenues in 2019. The only forcing to the data set is the dropping of the only observation in 2018 revenues that had 0 as value.

#drop the observation in 2018 revenues that has 0 as value
datawhole<-datawhole[!(datawhole$tot_rev_18==0),]            

#create the new variable as ration between 2019 and 2018 revenues multiplied by 100
datawhole['deltarev_100'] <- (datawhole$tot_rev_19 / datawhole$tot_rev_18)*100

Within the variable econ_sect_name, we want create as many variables as the sectors contained in this data set. In each of these variables - that are going to be 9- the sector in question will be coded as 1 while all the others observations as 0. This is a first step of categorization for the QCA we will later perform.

datawhole['energy_1'] <- ifelse(datawhole$econ_sect_name == "Energy", 1,0)
datawhole['bscmaterials_1'] <- ifelse(datawhole$econ_sect_name == "Basic Materials", 1,0)
datawhole['industrials_1'] <- ifelse(datawhole$econ_sect_name == "Industrials", 1,0)
datawhole['conscycl_1'] <- ifelse(datawhole$econ_sect_name == "Consumer Cyclicals", 1,0)
datawhole['consnoncycl_1'] <- ifelse(datawhole$econ_sect_name == "Consumer NonCyclicals", 1,0)
datawhole['financials_1'] <- ifelse(datawhole$econ_sect_name == "Financials", 1,0)
datawhole['healthcare_1'] <- ifelse(datawhole$econ_sect_name == "Healthcare", 1,0)
datawhole['technology_1'] <- ifelse(datawhole$econ_sect_name == "Technology", 1,0)
datawhole['utilities_1'] <- ifelse(datawhole$econ_sect_name == "Utilities", 1,0)

Within the variable continent, we want to apply the same rationale as the one for economic sectors and create as many variables as the geographical areas contained in the variable in question.

datawhole['asia_1'] <- ifelse(datawhole$continent == "Asia Pacific", 1,0)
datawhole['emergmrkt_1'] <- ifelse(datawhole$continent == "Emerging Markets", 1,0)
datawhole['europe_1'] <- ifelse(datawhole$continent == "Europe", 1,0)
datawhole['mddleastafrica_1'] <- ifelse(datawhole$continent == "Middle East Africa", 1,0)
datawhole['nrtamerica_1'] <- ifelse(datawhole$continent == "North America", 1,0)

5.3 New Data-set Creation: data

At this point, we are ready the create a new data-set containing only the variables that are going to be useful for the analysis. This will be all the newly created variables described above and the y of the data-set: OVERALL_SCORE, the overall ESG score rate of the companies.

data<- data.frame(datawhole$continent, datawhole$asia_1, datawhole$emergmrkt_1,
                  datawhole$europe_1, datawhole$mddleastafrica_1, datawhole$nrtamerica_1,
                  datawhole$log_rev_19, datawhole$deltarev_100,
                  datawhole$econ_sect_name, datawhole$energy_1,
                  datawhole$bscmaterials_1, datawhole$industrials_1, datawhole$conscycl_1,
                  datawhole$consnoncycl_1, datawhole$financials_1, datawhole$healthcare_1,
                  datawhole$technology_1, datawhole$utilities_1, datawhole$OVERALL_score)

data<- data %>% dplyr::rename(
  "CONTINENT" = datawhole.continent,
  "ASIA_1" = datawhole.asia_1,
  "BRICS_1" = datawhole.emergmrkt_1,
  "EUROPE_1" = datawhole.europe_1,
  "MDDLEAST_1" = datawhole.mddleastafrica_1,
  "NRTAMERICA_1" = datawhole.nrtamerica_1,
  "LOG_REV_19" = datawhole.log_rev_19,
  "DELTAREV_100" = datawhole.deltarev_100,
  "ECON_SEC_NAME" = datawhole.econ_sect_name,
  "ENERGY_1" = datawhole.energy_1,
  "BSCMATERIALS_1" = datawhole.bscmaterials_1,
  "INDUSTRIALS_1" = datawhole.industrials_1,
  "CONSCYCL_1"  = datawhole.conscycl_1,
  "CONSNONCYCL_1" = datawhole.consnoncycl_1, 
  "FINANCIALS_1" = datawhole.financials_1, 
  "HEALTHCARE_1" = datawhole.healthcare_1,
  "TECHNOLOGY_1"  = datawhole.technology_1, 
  "UTILITIES_1"  = datawhole.utilities_1,
  "OVERALL_SCORE" = datawhole.OVERALL_score
)

5.3.1 data: Variables codebook

These variables have been created from the ones in the original data-set so that they could specifically serve the Qualitative Comparative Analysis (QCA) that we are going to launch next. All the originally qualitative variables, such as the geographical area ones and the economic sector ones, have been split as many dichotomous variables as the number of categories contained in the original variable. In this sense, the geographical area variable in “datawhole” now becomes 5 dichotomous variables in “data”.

Each of these 5 dichotomous geographical area variables present: - “1” if the area corresponds to variable name
- “0” for all the other areas.

The same rationale can be applied for the economic sector name now split into 10 dichotomous variables.

Each of these 10 dichotomous economic sector name variables present: - “1” if the economic sector corresponds to variable name
- “0” for all the other economic sector.

In the interest of order, both the original variables containing geographical area information and economic sector information, respectively CONTINENT and ECON_SEC_NAME have been preserved in this new data-set but will not be effectively used for the analysis.

Moreover, the new data-set “data” contains: * the newly created variables for: - the 2019 revenues converted into logarithm to manage the great magnitude variability of the observations: LOG_REV_19; - the delta between revenues form 2019 and 2018 calculated as the ratio between revenues then multiplied by 100: DELTAREV_100;

and the y of this analysis:
- unchanged from the original data-set : OVERALL_SCORE.

These last 3 variables are not dichotomous meaning that they will have to undergo changes in terms of calibration during the Qualitative Comparative Analysis.

Variable Name	Short description
CONTINENT	all geographical locations of the companies present in the data-set
ASIA_1	countries located in Asia
BRICS_1	countries being part of the Emerging Markets: Brazil, Russia, India, China, and South Africa
EUROPE_1	countries located in Europe
MDDLEAST_1	countries located in Middle East
NRTAMERICA_1	countries located in North America
LOG_REV_19	2019 revenues converted into logarithm
DELTAREV_100	delta between revenues form 2019 and 2018 calculated as the ratio between revenues then multiplied by 100
ECON_SEC_NAME	all economical sectors of the companies present in the data-set
ENERGY_1	companies part of the Energy sector
BSCMATERIALS_1	companies part of the Basic Materials sector
INDUSTRIALS_1	companies part of the Industrial sector
CONSCYCL_1	companies part of the Consumer Cyclicals sector
CONSNONCYCL_1	companies part of the Consumer Non-Cyclicals sector
FINANCIALS_1	companies part of the Financials sector
HEALTHCARE_1	companies part of the Healthcare sector
TECHNOLOGY_1	companies part of the Technology sector
UTILITIES_1	companies part of the Utilities sector
OVERALL_SCORE	Overall ESG score

5.3.2 data: Descriptives

The new data-set called data has:

MIN overall ESG score	MAX overall ESG score	TOT observation number
6	73	3332

5.3.3 data: Frequencies Visualizations

Our data-set data will have this kind of representation in terms of visualization of the frequency of the economical sector:

graph_numsector <- data %>%
  group_by(ECON_SEC_NAME) %>%
  ggplot( aes(x=ECON_SEC_NAME, color=ECON_SEC_NAME)) +
  geom_bar( binwidth=3,fill="white", alpha=0.9) +
  ggtitle("Number of companies per sector") +
  scale_y_continuous(breaks= seq(0,4000, by= 200))
theme_ipsum() +
  theme(
    plot.title = element_text(size=15) 
  )


graph_numsector+xlab("Economic sector name")

While it presents this distribution for the geographical area end:

graph_numcontinent <- datawhole %>%
  group_by(continent) %>%
  ggplot( aes(x=continent, color=continent)) +
  geom_bar( binwidth=3,fill="white", alpha=0.9) +
  ggtitle("Number of companies per geographical area") +
  scale_y_continuous(breaks= seq(0,4000, by= 200))
theme_ipsum() +
  theme(
    plot.title = element_text(size=15) 
  )

graph_numcontinent +xlab("Geographical area")